Language trees and zipping.

نویسندگان

  • Dario Benedetto
  • Emanuele Caglioti
  • Vittorio Loreto
چکیده

In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comment on"Language Trees and Zipping"arXiv:cond-mat/0108530

every encoding has priori information if the encoding represents any semantic information of the unverse or object.Encoding means mapping from the unverse to the string or strings of digits. The semantic here is used in the model-theoretic sense or denotation of the object.if encoding or strings of symbols is the adequate and true mapping of model or object,and the mapping is recursive or compu...

متن کامل

Comment on "Language Trees and Zipping"

This is the extended version of a Comment submitted to Physical Review Letters. I first point out the inappropriateness of publishing a Letter unrelated to physics. Next, I give experimental results showing that the technique used in the Letter is 3 times worse and 17 times slower than a simple baseline. And finally, I review the literature, showing that the ideas of the Letter are not novel. I...

متن کامل

Extended Comment on Language Trees and Zipping

This is the extended version of a Comment submitted to Physical Review Letters. I first point out the inappropriateness of publishing a Letter unrelated to physics. Next, I give experimental results showing that the technique used in the Letter is 3 times worse and 17 times slower than a simple baseline. And finally, I review the literature, showing that the ideas of the Letter are not novel. I...

متن کامل

The Recent Letter " Language, Trees and Zipping " [1] Suggests Using Standard

compression programs to solve a number of problems. Unfortunately, the ideas are well known, and the technique, tested on a standard problem, is at least a factor of three worse than a simple baseline. In particular, the ideas of this Letter are very well known in several fields of Computers Science, including Machine Learning and Statistical Natural Language Processing. This Letter is essentia...

متن کامل

Language trees, zipping and error estimation

A method was recently proposed to estimate distances between a pair of given texts. The distance estimation appeared to be reliable enough to infer a phylogenic tree of languages, even though no error estimation has been provided. This essay reviews the method and explains its application for inferring phylogeny on a collection of heterogeneous texts. An approach for estimating the confidence o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Physical review letters

دوره 88 4  شماره 

صفحات  -

تاریخ انتشار 2002